# KNITR MUST BE VERSION 1.42 TO RENDER MAPS#Library Importslibrary('tidyverse')library('gendercoder')library('janitor')library("scales")library("sf")library('ggmap')library('plotly')library('leaflet')library('tippy')library('xfun')library('stringr')library('kableExtra')library('ggpubr')library('flextable')library("stringdist")
Code
#Needed to clean names for the inline code. More involved cleaning will be discussed.raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv') |> janitor::clean_names()
1 Introduction
Code
tippy::tippy_this(elementId ="random_sample", tooltip ="When all members of a population have equal likelihood to be sampled.")
Code
tippy::tippy_this(elementId ="wam", tooltip ="Weighted Average Mark")
DATA2X02 is a group of two units – DATA2002 and DATA2902 – offered within the School of Mathematics and Statistics at The University of Sydney. The units teach “advanced data analytic skills for a wide range of problems and data” (The University of Sydney 2023) with a focus on statistical methods to analyse and answer a scientific question.
1.1 Survey Method and Random Sampling
The raw dataset provided was sourced from a cohort survey which aimed to gain insight into the units’ cohort. Despite efforts to encourage student participation in the survey through an Ed Discussion Announcement and multiple reminders in labs and lectures, the response rate was 41%. It is important to note that due to this method of communication, there exists an argument that the survey participants may not have been a random sample of DATA2X02 students.
Students who were less engaged – possibly not attending lectures, labs or interacting with the Ed Discussion Board – are considerably less likely to have completed the survey compared to their counterparts who received multiple prompts. Moreover, those who are more engaged are likely to take time out of their day to fill out the survey after a reminder. This is evidenced by the DATA2902 (the advanced stream of DATA2X02) had a response rate of 71% compared to DATA2002’s rate of 37%. Students could also submit the survey multiple times, which may have skewed the data towards an individual if one was to submit many different responses Whilst acknowledging these shortcomings of the sampling method and subsequent response pattern, it is asserted that the survey still offers a moderately random sample of the DATA2X02 cohort.
1.2 Sources of Bias
There are some potential biases that may have occurred during this survey.
Non-response Bias – As discussed in Section 1.1, there may have been a non-response bias within the survey. Specifically, we see a difference in response rates between DATA2902 and DATA2002 students. This may have skewed the sample data towards the population of DATA2902 students, rather than DATA2X02 as a whole. This would be an issue if there is a significant difference between the populations of the two units. This is not out of the question, as those who opt to take an advanced stream of a unit may be more willing to challenge themselves and put in more effort into their studies. Moreover, there is the possibly that students do not opt for an advanced unit in order to priorities other aspects of their life, such as work.
Social desirability/conformity bias – Many of the questions asked in the survey have an associated ‘socially desirable’. For example, students may, whether consciously or unconsciously, overestimate the amount of hours they exercise, or underestimate the amount of time they spend on social media as these answers come with positive social connotations. Moreover, students may want to conform to the expected answer of the population. An example of this may be the question on whether or not students had experience in R coding. The majority of the DATA2X02 would have had experience in R as it was taught in many prerequisite courses, so those who didn’t have experience may answer incorrectly to conform with the rest of the cohort.
Recall Bias – Even if students did not suffer from social desirability or conformity bias, they may have simply not been able to recall the correct answer for a question. An example of this would be someone’s WAM. Many students may not know their actual WAM (as it is not reported when getting results or on the online academic transcript), and so they could incorrectly recall it when answering the survey. An instance of this is seen in the WAMs reported, with three students reporting their WAM of 99 or above, a value that could potentially be less accurate due to difficulties in recall.
1.3 Possible Improvements
There are many possible improvements which would help to generate useful data. Many of the questions regarding numeric data did not specify units in which an answer should be in, or whether the units should be included in the answer. This can be changed by specifying units in the question and only allowing numeric data to be input into the survey rather than free text. One such question was How much sleep do you get (on avg, per day)?. A better wording of this question would be How much sleep (in hours per night) do you get on average?. This was also an issue for the question How tall are you?, where answers were not given in a uniform manner. Rewording to How tall are you in cm? would have produced data which required much less cleaning. This extend to What is your shoe size?, where students responded with both US and European shoe sizes which are on a very different scale (a US 10 is a 43 European).
There were also issues regarding the categorical data. The question Would you prefer to study at Fisher Library or SciTech Library? did not need to include an Other response, as any answer of this type would not be answering the question asked. Moreover, the question Do you work? did not align with the suggested responses given. This question should have been What is your current employment status?. A similar issue was seen in this question Do you submit assignments on time?, which should have been How often do you submit assignments on time?. Finally, some questions could have included some options and an Other response, rather than free text. This was a particular issue for What brand is your laptop? and What is your favourite social media platform?, where students gave answers in many different forms when referring to the same category, e.g. Apple and Macbook being the same laptop brand. By providing some pre-defined answers, this would reduce the need for data cleaning.
1.4 Report Outline
This report will focus of the geographical characteristics of the cohort, with the Postcode of each response being used as a proxy for where a student lives. Specifically, hypothesis testing will be used to determine the impact of a student’s geographical region on a variety of variables.
SA4s are the “largest sub-State regions” and “represent labour markets or groups of labour markets within each State and Territory”. (Australian Bureau of Statistics 2021), with each SA4 has approximately 300,000 - 500,000 residents in metropolitan areas. These regions will be used to group together students into the geographical areas with ‘geographical, social and economic similarities’ (Australian Bureau of Statistics 2021). Figure 1 is a map made using Leaflet(Cheng, Karambelkar, and Xie 2023) which showcases the SA4s of Greater Sydney1.
A variety of data cleaning has been done in R (R Core Team 2023) and utilised the tidyverse packages (Wickham et al. 2019). The janitor package (Firke 2023) was initially used to help standardise the names of each column so that a reproducible introduction could be made. A new naming convention for the columns was adopted based from Tarr (2023). Some summary tables have also been created using gt(Iannone et al. 2023).
Column Name Conversion Table
Code
raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv')old_names =colnames(raw_df)df <- raw_dfnew_names =c("timestamp","n_units","task_approach","age","life","fass_unit","fass_major","novel","library","private_health","sugar_days","rent","post_code","haircut_days","laptop_brand","urinal_position","stall_position","n_weetbix","food_budget","pineapple","living_arrangements","height","uni_travel_method","feel_anxious","study_hrs","work","social_media","gender","sleep_time","diet","random_number","steak_preference","dominant_hand","normal_advanced","exercise_hrs","employment_hrs","on_time","used_r_before","team_role","social_media_hrs","uni_year","sport","wam","shoe_size")# overwrite the old names with the new names:colnames(df) = new_names# combine old and new into a data frame:name_combo =bind_cols(`New Names`= new_names, `Original Names`= old_names)name_combo |> gt::gt() |> gt::tab_header(title ="Column Name Cleaning") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Column Name Cleaning
New Names
Original Names
timestamp
Timestamp
n_units
How many units are you enrolled in this semester?
task_approach
When it comes to assignments / due tasks do you:
age
How old are you?
life
Do you tend to lean towards saying "yes" or towards saying "no" to things throughout life?
fass_unit
Have you taken one or more units of study from the Faculty of Arts and Social Sciences?
fass_major
Are you completing a major or minor in a subject area from the Faculty of Arts and Social Sciences?
novel
Have you read a novel this year?
library
Would you prefer to study at Fisher Library or SciTech Library?
private_health
Do you have private health insurance?
sugar_days
How many days in a week you normally consume sweets/chocolates/sugary drinks? (Exclude Diet/Sugar Free Drinks & sweets)?
rent
Do you pay rent?
post_code
What is your post code?
haircut_days
How many days do you go between haircuts (on average)?
laptop_brand
What brand is your laptop?
urinal_position
You enter a public bathroom and find you're the only one there. There are three urinals on the wall for you to choose from. Which do you choose?
stall_position
You enter a public bathroom and there are three stalls to choose from. All three are unoccupied. Which do you choose?
n_weetbix
How many Weet-Bix would you typically eat in one sitting?
food_budget
What is the average amount of money you spend each week on food/beverages?
pineapple
Do you like pineapple on pizza?
living_arrangements
What are your current living arrangements?
height
How tall are you?
uni_travel_method
How do you get to university?
feel_anxious
How often would you say you feel anxious on a daily basis?
study_hrs
How many hours a week do you spend studying?
work
Do you work?
social_media
What is your favourite social media platform?
gender
What is your gender?
sleep_time
How much sleep do you get (on avg, per day)?
diet
What is your diet style?
random_number
Pick a number at random between 0 and 9
steak_preference
How do you like your steak cooked?
dominant_hand
What is your dominant hand?
normal_advanced
Which unit are you enrolled in?
exercise_hrs
On average, how many hours each week do you spend exercising?
employment_hrs
How many hours a week (on average) do you work in paid employment?
on_time
Do you submit assignments on time?
used_r_before
Have you ever used R before starting DATA2x02?
team_role
What kind of role (active or passive) do you think you are when working as part of a team?
social_media_hrs
How many hours do you spend on social media per day?
uni_year
Which year of university are you currently in?
sport
Which sports do you play most often?
wam
What is your WAM?
shoe_size
What is your shoe size?
The SA4 name of each row was also joined onto the survey data using a reference table made by Proctor (2023).
Code
sa4_postcode_df <- readr::read_csv('Data/sa4_postcode.csv') |>select(c(`Postcode`, `SA4 Name`)) |>unique() |>filter(!((`Postcode`==2232) & (`SA4 Name`=='Southern Highlands and Shoalhaven')))colnames(sa4_postcode_df) <-c('post_code', 'sa4_name')sa4_postcode_df$post_code <-as.character(sa4_postcode_df$post_code) df$post_code <-as.character(gsub("[^0-9]", "", df$post_code))df <- df |>left_join(sa4_postcode_df)df |>count(sa4_name) |>arrange(desc(n)) |> gt::gt() |> gt::cols_label(sa4_name ="SA4 Name", n='Count of Students') |> gt::tab_header(title ="Count of Students by SA4") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Count of Students by SA4
SA4 Name
Count of Students
Sydney - City and Inner South
123
Sydney - Inner West
35
NA
31
Sydney - North Sydney and Hornsby
30
Sydney - Ryde
18
Sydney - Inner South West
16
Sydney - Parramatta
14
Sydney - Northern Beaches
11
Sydney - Eastern Suburbs
9
Sydney - Blacktown
6
Sydney - Outer West and Blue Mountains
6
Sydney - South West
5
Sydney - Baulkham Hills and Hawkesbury
2
Sydney - Outer South West
2
Sydney - Sutherland
2
Central Coast
1
Riverina
1
The SA4s were further grouped together geographically to collapse some of the groups with lower student counts. Figure 2 is a map of the groupings of SA4s into regions. A conversion table was generated using flextable(Gohel and Skintzos 2023).
SA4 to Region Conversion Table
Code
north_sydney =c('Sydney - North Sydney and Hornsby', 'Sydney - Ryde', 'Sydney - Northern Beaches')city_and_eastern_suburbs =c('Sydney - City and Inner South', 'Sydney - Eastern Suburbs')inner_west =c('Sydney - Inner West', 'Sydney - Parramatta', 'Sydney - Inner South West')outer_south_west =c('Sydney - Blacktown', 'Sydney - South West', 'Sydney - Sutherland', 'Sydney - Outer West and Blue Mountains', 'Sydney - Outer South West')riverina_and_central_coast =c('Sydney - Baulkham Hills and Hawkesbury', 'Central Coast', 'Riverina')df <- df |>mutate(geographic_regions =case_when( sa4_name %in% north_sydney ~'North Sydney', sa4_name %in% city_and_eastern_suburbs ~'City and Eastern Suburbs', sa4_name %in% inner_west ~'Inner West',!is.na(sa4_name) ~'Outer South West, Greater Sydney and Regional NSW',TRUE~NA ))mapping_df <- df |>select(geographic_regions, sa4_name) |>unique() |>drop_na() |>arrange(geographic_regions) |>mutate(`Region`= geographic_regions, `SA4 Name`=sa4_name) |>select(Region, `SA4 Name`)flextable(mapping_df) |>merge_v() |>theme_vanilla() |>width(2, 4) |>width(1, 2)
Figure 2: Map of SA4s grouped into Regions for students in DATA2X02
A flagging column was made that identifed if someone travelled by car.
Code
df <- df |>mutate(car_flag =ifelse(str_detect(uni_travel_method, "Car"), "Drive", ifelse(is.na(uni_travel_method), NA, "Other")))df |>count(car_flag) |> gt::gt() |> gt::cols_label(car_flag ="Does the Student Drive to Univeristy?", n='Count of Students') |> gt::tab_header(title ="Count of Students by Whether or Not they Travel by Car") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Count of Students by Whether or Not they Travel by Car
Does the Student Drive to Univeristy?
Count of Students
Drive
66
Other
242
NA
4
Employment hours of each respondent was binned into categories.
Code
bin_ranges <-c(0, 1, 10, Inf)bin_labels <-c("0", "1-10","11+")# Create a new column with binned valuesdf$employment_hrs_bin <-cut(df$employment_hrs, breaks = bin_ranges, labels = bin_labels, include.lowest =TRUE)df |>count(employment_hrs_bin) |> gt::gt() |> gt::cols_label(employment_hrs_bin ="Employment Hours", n='Count of Students') |> gt::tab_header(title ="Count of Students by Employment Hours") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Count of Students by Employment Hours
Employment Hours
Count of Students
0
144
1-10
72
11+
79
NA
17
Outliers of WAM where set to NA, as this may be international students who have a different WAM system or people who do not know their WAM. It was judged at \(\pm 3\) standard deviations from the mean.
Figure 3: Histogram of students’ WAM with outliers removed
2 Hypothesis Testing
2.1 Does Living in Sydney’s City and Eastern Suburbs Influence if Students Drive to University?
Given the University of Sydney is located in in Sydney’s City and Eastern Suburbs, it is suspected that students may opt for the use of public transport, rather than driving to university if they live close to the university. This is of interest as effective carbon emissions of the university can be reduced if more students use public transport.
Code
car_df <- df |>select(c(geographic_regions, car_flag)) |>mutate(geographic_regions =ifelse(geographic_regions =='City and Eastern Suburbs', 'City and Eastern Suburbs', 'Other')) |>drop_na() |>mutate(`Travel Method`= car_flag)car_df |>ggplot() +aes(x=geographic_regions, fill=`Travel Method`) +geom_bar(colour ="black", #Creates a proportion bar chartlinewidth =0.5,position ="fill") +labs(y="Proportion of Travel Method", #Changes the axis label and titlex="Region", title="Proportion of Students who drive to Univerity \n based on Geograhical Location",legend="Travel Method") +theme(plot.background =element_rect(fill ="#ffffff", #Changes the aesthetics of the chartlinewidth =0),legend.background =element_rect(fill ="#ffffff", linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +scale_y_continuous(labels = scales::percent) +scale_fill_brewer(palette ="Set2")
Figure 4: Proportion bar chart of travel method for different regions
A \(\chi^2\)-test for independence was performed at the \(\alpha = 0.05\) level on the below contingency table. A Monte-Carlo simulation of size \(6000\) was used to calculate the test statistic and \(p\) value.
Code
contingency_table <-table(car_df$geographic_regions, car_df$car_flag) |>as.data.frame.matrix()contingency_table$`Region`=c('City and Eastern Suburbs', 'Other')contingency_table |> gt::gt() |> gt::cols_move_to_start(columns=c(`Region`)) |> gt::tab_spanner(label ="Method of Travel", columns =1:2) |> gt::tab_header(title ="Count of Students by Method of Travel") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Hypothesis – \(H_0\): The method of travel of a student is independent of living in the Sydney’s City and Eastern Suburbs. \(H_1\): There is some interdependence between method of travel and living in Sydney’s City or Inner South.
Assumptions – The observations are independent, and the expected cell counts are greater than equal to 5. The observations are independent as this was a survey that could only be filled out once. There were zero expected cell counts less than 5, so these assumptions hold.
Test Statistic – \[T = \sum_{i=1}^2 \sum_{j=1}^2 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}\] Under \(H_0\), \(T\sim \chi^2_1\).
Observed Test Statistic – \(t_0=\) 20.3285.
p-value – The proportion of simulated test statistics that were as or more extreme than \(t_0\) was \(p=\) 0.00017.
Decision – As the \(p\) -value was \(<\alpha\), we can reject \(H_0\). This implies that is some interdependence between method of travel and living in Sydney’s City or Inner South.
2.2 Is academic performance significantly better for students living in North Sydney compared to those in the Inner West?
A student’s WAM is one measure of academic performance. Knowing if WAM is impacted by where students live could be useful to know, as it could allow the University to provide targeted academic help.
A Welch two-sample one-sided \(t\)-test at the \(\alpha = 0.05\) level was conducted to determine if the mean WAM of students in North Sydney is larger than those living in the Inner West. Initial EDA suggests this may be the case, with the mean WAM of students being 76.4 and 74.1 respectively. We can also generate a QQ-plot of students’ WAM, which shows the variable is normally distributed as it follows a linear regression.
Code
wam_df |>ggplot() +aes(y=wam, color=geographic_regions, x=geographic_regions, fill=geographic_regions) +geom_boxplot()+scale_color_manual(values =c('darkgreen','darkred')) +scale_fill_manual(values =c(rgb(217/255, 227/255, 215/255), rgb(230/255, 215/255, 214/255))) +theme(legend.position="none") +theme(plot.background =element_rect(fill ="#ffffff", #Changes the aesthetics of the chartlinewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +labs(y="WAM", #Changes the axis label and titlex="Region", title="A: Grouped Box Plot of WAM by Region")ggqqplot(wam_df, x ="wam", facet.by ="geographic_regions", color ="geographic_regions", palette=c('darkgreen','darkred'), legend='none', title="B: QQ-plot of WAM") +theme(plot.background =element_rect(fill ="#ffffff", #Changes the aesthetics of the chartlinewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5))test <-t.test(north_sydney_wam, inner_west_wam, alternative ='greater')shapiro1 <-shapiro.test((wam_df |>filter(geographic_regions =='Inner West'))$wam)shapiro2 <-shapiro.test((wam_df |>filter(geographic_regions =='North Sydney'))$wam)degrees_of_freedom <- test$parameter
Figure 5: A: Box plot of WAMs of students from the Inner West and North Sydney
Figure 6: B: QQ-plot of WAMs of students from the Inner West and North Sydney
Welch two-sample one-sided \(t\)-test
Hypothesis – \(H_0\): The mean WAM of student from North Sydney \(\mu_{NS}\) equal the mean WAM of students from the Inner West \(\mu_{IW}\). \(H_1\): \(\mu_{NS}\) is greater than \(\mu_{NS}\).
Assumptions – The observations of both groups were independently and identically distributed to \(\mathcal{N}(\mu_{i}, \sigma_{i}^2)\) for \(i=NS, IW\), and that the observations of each group were independent. The observations are independent as this was a survey that could only be filled out once. The above QQ-plot (Figure 6) shows that the WAM is normally distributed. Moreover using a Shapiro-Wilk test, both groups were consistent with a \(X\sim\mathcal{N}(\mu_{i}, \sigma_{i}^2\), with p values of 0 for Inner West and 0 for North Sydney.
Test Statistic – \[T=\frac{\overline{NS}-\overline{IW}}{\sqrt{\frac{S_{ns}^2}{n_{ns}}+\frac{S_{iw}^2}{n_{iw}}}}\] Here, \(S_{ns}^2\) and \(S_{iw}^2\) are the sample variance of the \(NS\) (North Sydney) and \(IW\) (Inner West) samples. Under \(H_0\), \(T\sim t_{\nu}\), where \(\nu=\) 104.47 as estimated from the data.
Decision – As the \(p\) -value was \(<\alpha\), we can reject \(H_0\). This implies that the mean WAM of students from North Sydney is significantly greater than those who live in the Inner West.
2.3 Does a student’s Region have a significant influence on how many hours they work?
Initial exploration of the data set suggested that there was a non-uniform distribution of working hours across different regions. The proportion fo students working between one and 10 hours per week was relatively similar, and the main differences were observed when comparing the proportion of students working no or more than 11 hours a week.
Code
employment_df <- df |>select(c(geographic_regions, employment_hrs_bin)) |>drop_na() |>mutate(`Employment Hours per Week`= employment_hrs_bin)employment_df |>ggplot() +aes(x=geographic_regions, fill=`Employment Hours per Week`) +geom_bar(colour ="black", #Creates a proportion bar chartlinewidth =0.5,position ="fill") +labs(y="Proportion of Hours Worked Category", #Changes the axis label and titlex="Region", title="Proportion of Students in Hours Worked by Region",legend="Travel Method") +theme(plot.background =element_rect(fill ="#ffffff", #Changes the aesthetics of the chartlinewidth =0),legend.background =element_rect(fill ="#ffffff", linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +scale_y_continuous(labels = scales::percent) +scale_x_discrete(labels=c("City and \n Eastern Suburbs", "Inner West", "North Sydney", "Outer South West, \n Greater Sydney and \n Regional NSW")) +scale_fill_brewer(palette ="Set2")
Figure 7: Proportion bar chart of hours worked for different regions.
A \(\chi^2\)-test for independence was performed at the \(\alpha = 0.05\) level on the below contingency table. Yates’s correction for continuity was used in the test.
Code
contingency_table <-table(employment_df$geographic_regions, employment_df$employment_hrs_bin) |>as.data.frame.matrix()contingency_table$`Region`=c('City and Eastern Suburbs', 'Inner West', 'North Sydney', 'Outer South West,\n Greater Sydney and \n Regional NSW')contingency_table |> gt::gt() |> gt::cols_move_to_start(columns=c(`Region`)) |> gt::tab_spanner(label ="Hours Worked", columns =1:3) |> gt::tab_header(title ="Count of Students by Hours Worked") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Count of Students by Hours Worked
Region
Hours Worked
0
1-10
11+
City and Eastern Suburbs
82
27
19
Inner West
34
15
14
North Sydney
15
15
27
Outer South West,
Greater Sydney and
Regional NSW
6
8
11
Code
test <-chisq.test(table(employment_df$geographic_regions, employment_df$employment_hrs_bin))degrees_of_freedom <- test$parameter
\(\chi^2\)-test for independence
Hypothesis – \(H_0\): The amount of hours worked by a student is independent of their region. \(H_1\): There is some interdependence between amount of hours worked and region.
Assumptions – The observations are independent, and the expected cell counts are greater than equal to 5. The observations are independent as this was a survey that could only be filled out once. There were zero expected cell counts less than 5, so these assumptions hold.
Test Statistic – \[T = \sum_{i=1}^3 \sum_{j=1}^4 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}\] Under \(H_0\), \(T\sim \chi^2_{6}\).
Observed Test Statistic – \(t_0=\) 35.8235
p-value – \(p=P(\chi^2_{6} \geq t_0)<0.001\).
Decision – As the \(p\) -value was \(<\alpha\), we can reject \(H_0\). This implies that is some interdependence between hours worked in a week and a student’s region.
3 Conclusion
The geographic characteristics have been investigated during this report by grouping DATA2X02 students into regions and performing hypothesis tests on differing variables.
Throughout the analysis, it was seen that geographic regions played a statistically significant role in the distribution of Travel Method, WAM and Employment Hours per Week. Specifically, it was found that the method of travel of students is dependent on whether they live in Sydney’s City or Eastern Suburbs or not, the mean WAM of students from North Sydney is greater than those living in the Inner West, and employment hours per week was dependent on region.
Future investigation into DATA2X02 cohorts may look to validate these results (to see if they are consistent to all DATA2X02 cohorts or just the 2023 cohort), as well as source more specific geographical information about students, rather than using their postcode.
Cheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2023. Leaflet: Create Interactive Web Maps with the JavaScript ’Leaflet’library. https://CRAN.R-project.org/package=leaflet.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.”Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
---title: "A Statistical Investigation into the Geographic Characteristics of the DATA2X02 Cohort"date: "`r format(Sys.time(), '%d %B, %Y')`"author: "Harry Breden"format: html: theme: united embed-resources: true # Creates a single HTML file as output code-fold: true # Code folding; allows you to show/hide code chunks code-tools: true # Includes a menu to download the code file # code-tools are particularly important if you use inline R to # improve the reproducibility of your reporttable-of-contents: true # (Optional) Creates a table of contentsnumber-sections: true # (Optional) Puts numbers next to heading/subheadingsbibliography: bibliography.bibpage-layout: fullsidebar-width: 0pxfig-align: center---```{r message=FALSE}# KNITR MUST BE VERSION 1.42 TO RENDER MAPS#Library Importslibrary('tidyverse')library('gendercoder')library('janitor')library("scales")library("sf")library('ggmap')library('plotly')library('leaflet')library('tippy')library('xfun')library('stringr')library('kableExtra')library('ggpubr')library('flextable')library("stringdist")``````{r, message=FALSE}#Needed to clean names for the inline code. More involved cleaning will be discussed.raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv') |> janitor::clean_names()```# Introduction```{r}tippy::tippy_this(elementId ="random_sample", tooltip ="When all members of a population have equal likelihood to be sampled.")tippy::tippy_this(elementId ="wam", tooltip ="Weighted Average Mark")```DATA2X02 is a group of two units -- [DATA2002](https://www.sydney.edu.au/units/DATA2002) and [DATA2902](https://www.sydney.edu.au/units/DATA2902) -- offered within the School of Mathematics and Statistics at The University of Sydney. The units teach "advanced data analytic skills for a wide range of problems and data" [@DATA2902] with a focus on statistical methods to analyse and answer a scientific question.## Survey Method and Random Sampling {#sec-rs}The [raw dataset](https://docs.google.com/spreadsheets/d/e/2PACX-1vR9Ve_Zi-dM5K96ku2tnmaMrpX3Gk3y9KHcYsNIkzyyna8tRWOxBt_iDsZI_UzXFHLidPU6vY9bml4n/pub?output=csv) provided was sourced from a [cohort survey](https://pages.github.sydney.edu.au/DATA2002/2023/extra/DATA2x02_survey_2023.pdf) which aimed to gain insight into the units' cohort. Despite efforts to encourage student participation in the survey through an Ed Discussion Announcement and multiple reminders in labs and lectures, the response rate was `r scales::percent(nrow(raw_df)/759)`. It is important to note that due to this method of communication, there exists an argument that the survey participants may not have been a [[random sample]{style="text-decoration: underline;"}]{id='random_sample'} of DATA2X02 students. Students who were less engaged -- possibly not attending lectures, labs or interacting with the Ed Discussion Board -- are considerably less likely to have completed the survey compared to their counterparts who received multiple prompts. Moreover, those who are more engaged are likely to take time out of their day to fill out the survey after a reminder. This is evidenced by the DATA2902 (the advanced stream of DATA2X02) had a response rate of `r scales::percent(nrow(filter(raw_df, which_unit_are_you_enrolled_in == 'DATA2902'))/84)` compared to DATA2002's rate of `r scales::percent(nrow(filter(raw_df, which_unit_are_you_enrolled_in == 'DATA2002'))/675)`. Students could also submit the survey multiple times, which may have skewed the data towards an individual if one was to submit many different responses Whilst acknowledging these shortcomings of the sampling method and subsequent response pattern, it is asserted that the survey still offers a moderately random sample of the DATA2X02 cohort.## Sources of BiasThere are some potential biases that may have occurred during this survey.- **Non-response Bias** -- As discussed in @sec-rs, there may have been a non-response bias within the survey. Specifically, we see a difference in response rates between DATA2902 and DATA2002 students. This may have skewed the sample data towards the population of DATA2902 students, rather than DATA2X02 as a whole. This would be an issue if there is a significant difference between the populations of the two units. This is not out of the question, as those who opt to take an advanced stream of a unit may be more willing to challenge themselves and put in more effort into their studies. Moreover, there is the possibly that students do not opt for an advanced unit in order to priorities other aspects of their life, such as work.- **Social desirability/conformity bias** -- Many of the questions asked in the survey have an associated 'socially desirable'. For example, students may, whether consciously or unconsciously, overestimate the amount of hours they exercise, or underestimate the amount of time they spend on social media as these answers come with positive social connotations. Moreover, students may want to conform to the expected answer of the population. An example of this may be the question on whether or not students had experience in R coding. The majority of the DATA2X02 would have had experience in R as it was taught in many prerequisite courses, so those who didn't have experience may answer incorrectly to conform with the rest of the cohort.- **Recall Bias** -- Even if students did not suffer from social desirability or conformity bias, they may have simply not been able to recall the correct answer for a question. An example of this would be someone's WAM. Many students may not know their actual WAM (as it is not reported when getting results or on the online academic transcript), and so they could incorrectly recall it when answering the survey. An instance of this is seen in the WAMs reported, with `r numbers_to_words(raw_df |> filter(what_is_your_wam >= 99) |> nrow())` students reporting their WAM of 99 or above, a value that could potentially be less accurate due to difficulties in recall.## Possible ImprovementsThere are many possible improvements which would help to generate useful data. Many of the questions regarding numeric data did not specify units in which an answer should be in, or whether the units should be included in the answer. This can be changed by specifying units in the question and only allowing numeric data to be input into the survey rather than free text. One such question was `How much sleep do you get (on avg, per day)?`. A better wording of this question would be `How much sleep (in hours per night) do you get on average?`. This was also an issue for the question `How tall are you?`, where answers were not given in a uniform manner. Rewording to `How tall are you in cm?` would have produced data which required much less cleaning. This extend to `What is your shoe size?`, where students responded with both US and European shoe sizes which are on a very different scale (a US 10 is a 43 European). There were also issues regarding the categorical data. The question `Would you prefer to study at Fisher Library or SciTech Library?` did not need to include an `Other` response, as any answer of this type would not be answering the question asked. Moreover, the question `Do you work?` did not align with the suggested responses given. This question should have been `What is your current employment status?`. A similar issue was seen in this question `Do you submit assignments on time?`, which should have been `How often do you submit assignments on time?`. Finally, some questions could have included some options and an `Other` response, rather than free text. This was a particular issue for `What brand is your laptop?` and `What is your favourite social media platform?`, where students gave answers in many different forms when referring to the same category, e.g. Apple and Macbook being the same laptop brand. By providing some pre-defined answers, this would reduce the need for data cleaning.## Report OutlineThis report will focus of the geographical characteristics of the cohort, with the `Postcode` of each response being used as a proxy for where a student lives. Specifically, hypothesis testing will be used to determine the impact of a student's geographical region on a variety of variables.SA4s are the "largest sub-State regions" and "represent labour markets or groups of labour markets within each State and Territory". [@ABS2021], with each SA4 has approximately 300,000 - 500,000 residents in metropolitan areas. These regions will be used to group together students into the geographical areas with 'geographical, social and economic similarities' [@ABS2021]. @fig-sa4-map is a map made using `Leaflet`[@leaflet2023] which showcases the SA4s of Greater Sydney^[Shape Files used in this map are available [here](https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&1270055001_sa4_2016_aust_midmif.zip&1270.0.55.001&Data%20Cubes&7512AFCD3D8FED2DCA257FED001451F6&0&July%202016&12.07.2016&Latest) [@ABS2016]].```{r warning=FALSE, results=FALSE}sa4_df <-st_read('Data/1270055001_sa4_2016_aust_shape')sa4_df_filter <- sa4_df |>filter(GCC_NAME16 =='Greater Sydney')``````{r warning=FALSE}#| label: fig-sa4-map#| fig-cap: "Map of SA4s in Greater Sydney"p_popup <-paste0("<strong>Name: </strong>", sa4_df_filter$SA4_NAME16)leaflet(sa4_df_filter) %>%addPolygons(popup = p_popup,fillColor ='lightblue',opacity =1.0,weight =2,color ="darkblue",fillOpacity =0.2) %>%addTiles()```## Data CleaningA variety of data cleaning has been done in R [@R2023] and utilised the `tidyverse` packages [@tidyverse2019]. The `janitor` package [@janior2023] was initially used to help standardise the names of each column so that a reproducible introduction could be made. A new naming convention for the columns was adopted based from @tarr2023. Some summary tables have also been created using `gt`[@gt2023].<details><summary>Column Name Conversion Table</summary>```{r message=FALSE}raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv')old_names =colnames(raw_df)df <- raw_dfnew_names =c("timestamp","n_units","task_approach","age","life","fass_unit","fass_major","novel","library","private_health","sugar_days","rent","post_code","haircut_days","laptop_brand","urinal_position","stall_position","n_weetbix","food_budget","pineapple","living_arrangements","height","uni_travel_method","feel_anxious","study_hrs","work","social_media","gender","sleep_time","diet","random_number","steak_preference","dominant_hand","normal_advanced","exercise_hrs","employment_hrs","on_time","used_r_before","team_role","social_media_hrs","uni_year","sport","wam","shoe_size")# overwrite the old names with the new names:colnames(df) = new_names# combine old and new into a data frame:name_combo =bind_cols(`New Names`= new_names, `Original Names`= old_names)name_combo |> gt::gt() |> gt::tab_header(title ="Column Name Cleaning") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')```</details>The SA4 name of each row was also joined onto the survey data using a reference table made by @Proctor2023.```{r warning=FALSE, message=FALSE}sa4_postcode_df <- readr::read_csv('Data/sa4_postcode.csv') |>select(c(`Postcode`, `SA4 Name`)) |>unique() |>filter(!((`Postcode`==2232) & (`SA4 Name`=='Southern Highlands and Shoalhaven')))colnames(sa4_postcode_df) <-c('post_code', 'sa4_name')sa4_postcode_df$post_code <-as.character(sa4_postcode_df$post_code) df$post_code <-as.character(gsub("[^0-9]", "", df$post_code))df <- df |>left_join(sa4_postcode_df)df |>count(sa4_name) |>arrange(desc(n)) |> gt::gt() |> gt::cols_label(sa4_name ="SA4 Name", n='Count of Students') |> gt::tab_header(title ="Count of Students by SA4") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')```The SA4s were further grouped together geographically to collapse some of the groups with lower student counts. @fig-region-map is a map of the groupings of SA4s into regions. A conversion table was generated using `flextable`[@flextable2023].<details><summary>SA4 to Region Conversion Table</summary>```{r message=FALSE}north_sydney =c('Sydney - North Sydney and Hornsby', 'Sydney - Ryde', 'Sydney - Northern Beaches')city_and_eastern_suburbs =c('Sydney - City and Inner South', 'Sydney - Eastern Suburbs')inner_west =c('Sydney - Inner West', 'Sydney - Parramatta', 'Sydney - Inner South West')outer_south_west =c('Sydney - Blacktown', 'Sydney - South West', 'Sydney - Sutherland', 'Sydney - Outer West and Blue Mountains', 'Sydney - Outer South West')riverina_and_central_coast =c('Sydney - Baulkham Hills and Hawkesbury', 'Central Coast', 'Riverina')df <- df |>mutate(geographic_regions =case_when( sa4_name %in% north_sydney ~'North Sydney', sa4_name %in% city_and_eastern_suburbs ~'City and Eastern Suburbs', sa4_name %in% inner_west ~'Inner West',!is.na(sa4_name) ~'Outer South West, Greater Sydney and Regional NSW',TRUE~NA ))mapping_df <- df |>select(geographic_regions, sa4_name) |>unique() |>drop_na() |>arrange(geographic_regions) |>mutate(`Region`= geographic_regions, `SA4 Name`=sa4_name) |>select(Region, `SA4 Name`)flextable(mapping_df) |>merge_v() |>theme_vanilla() |>width(2, 4) |>width(1, 2)```</details>```{r messgae=FALSE, warning=FALSE}#| label: fig-region-map#| fig-cap: "Map of SA4s grouped into Regions for students in DATA2X02"sa4_df_in_survey <- sa4_df |>filter(SA4_NAME16 %in% df$sa4_name)sa4_df_in_survey <- sa4_df_in_survey |>mutate(geographic_regions =case_when( SA4_NAME16 %in% north_sydney ~'North Sydney', SA4_NAME16 %in% city_and_eastern_suburbs ~'City and Eastern Suburbs', SA4_NAME16 %in% inner_west ~'Inner West',!is.na(SA4_NAME16) ~'Outer South West, Greater Sydney and Regional NSW',TRUE~NA ))factpal <-colorFactor(c('darkblue', 'darkgreen', 'darkred', 'purple'), sa4_df_in_survey$geographic_regions)p_popup <-paste0("<strong>Name: </strong>", sa4_df_in_survey$SA4_NAME16)leaflet(sa4_df_in_survey) %>%addPolygons(popup = p_popup,fillColor =~factpal(geographic_regions),opacity =1.0,weight =2,color =~factpal(geographic_regions),fillOpacity =0.1) %>%addTiles() %>%addLegend("bottomleft", pal = factpal, values =~geographic_regions, title='Region')```<br>A flagging column was made that identifed if someone travelled by car.```{r}df <- df |>mutate(car_flag =ifelse(str_detect(uni_travel_method, "Car"), "Drive", ifelse(is.na(uni_travel_method), NA, "Other")))df |>count(car_flag) |> gt::gt() |> gt::cols_label(car_flag ="Does the Student Drive to Univeristy?", n='Count of Students') |> gt::tab_header(title ="Count of Students by Whether or Not they Travel by Car") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')```<br>Employment hours of each respondent was binned into categories.```{r}bin_ranges <-c(0, 1, 10, Inf)bin_labels <-c("0", "1-10","11+")# Create a new column with binned valuesdf$employment_hrs_bin <-cut(df$employment_hrs, breaks = bin_ranges, labels = bin_labels, include.lowest =TRUE)df |>count(employment_hrs_bin) |> gt::gt() |> gt::cols_label(employment_hrs_bin ="Employment Hours", n='Count of Students') |> gt::tab_header(title ="Count of Students by Employment Hours") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')```<br>Outliers of WAM where set to `NA`, as this may be international students who have a different WAM system or people who do not know their WAM. It was judged at $\pm 3$ standard deviations from the mean.```{r warning=FALSE}#| label: fig-wam-hist#| fig-cap: "Histogram of students' WAM with outliers removed"remove_outlier <-function(vec){ threshold1 =mean(vec[!is.na(vec)]) +3*sd(vec[!is.na(vec)]) threshold2 =mean(vec[!is.na(vec)]) -3*sd(vec[!is.na(vec)]) vec[vec > threshold1 | vec < threshold2] <-NAreturn(vec)}df[['wam']] <-remove_outlier(df[['wam']])df %>%ggplot(aes(x=wam)) +geom_histogram(bins =20, fill ="steelblue1", color ="black") +labs(x="WAM", y="Frequency", title="Histogram of Students' WAM with Outliers Removed") +theme(legend.position="none", plot.background =element_rect(fill ="#ffffff", linewidth =0), axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =13, hjust =0.5))```# Hypothesis Testing## Does Living in Sydney's City and Eastern Suburbs Influence if Students Drive to University?Given the University of Sydney is located in in Sydney's City and Eastern Suburbs, it is suspected that students may opt for the use of public transport, rather than driving to university if they live close to the university. This is of interest as effective carbon emissions of the university can be reduced if more students use public transport.<center>```{r}#| label: fig-drive_graph#| fig-cap: "Proportion bar chart of travel method for different regions"car_df <- df |>select(c(geographic_regions, car_flag)) |>mutate(geographic_regions =ifelse(geographic_regions =='City and Eastern Suburbs', 'City and Eastern Suburbs', 'Other')) |>drop_na() |>mutate(`Travel Method`= car_flag)car_df |>ggplot() +aes(x=geographic_regions, fill=`Travel Method`) +geom_bar(colour ="black", #Creates a proportion bar chartlinewidth =0.5,position ="fill") +labs(y="Proportion of Travel Method", #Changes the axis label and titlex="Region", title="Proportion of Students who drive to Univerity \n based on Geograhical Location",legend="Travel Method") +theme(plot.background =element_rect(fill ="#ffffff", #Changes the aesthetics of the chartlinewidth =0),legend.background =element_rect(fill ="#ffffff", linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +scale_y_continuous(labels = scales::percent) +scale_fill_brewer(palette ="Set2")```</center>A $\chi^2$-test for independence was performed at the $\alpha = 0.05$ level on the below contingency table. A Monte-Carlo simulation of size $6000$ was used to calculate the test statistic and $p$ value.```{r}contingency_table <-table(car_df$geographic_regions, car_df$car_flag) |>as.data.frame.matrix()contingency_table$`Region`=c('City and Eastern Suburbs', 'Other')contingency_table |> gt::gt() |> gt::cols_move_to_start(columns=c(`Region`)) |> gt::tab_spanner(label ="Method of Travel", columns =1:2) |> gt::tab_header(title ="Count of Students by Method of Travel") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')``````{r}set.seed(1)test <-chisq.test(table(car_df$car_flag, car_df$geographic_regions), simulate.p.value=TRUE, B=6000)```::: {.callout-note}## $\chi^2$-test for independence1. **Hypothesis** -- $H_0$: The method of travel of a student is independent of living in the Sydney's City and Eastern Suburbs. $H_1$: There is some interdependence between method of travel and living in Sydney's City or Inner South.2. **Assumptions** -- The observations are independent, and the expected cell counts are greater than equal to 5. The observations are independent as this was a survey that could only be filled out once. There were `r numbers_to_words(sum(test$expected < 5))` expected cell counts less than 5, so these assumptions hold.3. **Test Statistic** -- $$T = \sum_{i=1}^2 \sum_{j=1}^2 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}$$Under $H_0$, $T\sim \chi^2_1$.4. **Observed Test Statistic** -- $t_0=$ `r round(as.numeric(str_replace_all(test$statistic, "[^(0-9)|.]", "")), 4)`.5. **p-value** -- The proportion of simulated test statistics that were as or more extreme than $t_0$ was $p=$ `r as.character(round(test$p.value, 5))`.6. **Decision** -- As the $p$ -value was $<\alpha$, we can reject $H_0$. This implies that is some interdependence between method of travel and living in Sydney's City or Inner South.:::## Is academic performance significantly better for students living in North Sydney compared to those in the Inner West?A student's [[WAM]{style="text-decoration: underline;"}]{id='wam'} is one measure of academic performance. Knowing if WAM is impacted by where students live could be useful to know, as it could allow the University to provide targeted academic help.```{r}wam_df <- df |>filter(geographic_regions %in%c('Inner West', 'North Sydney')) |>select(geographic_regions, wam) |>drop_na()inner_west_wam <-filter(wam_df, geographic_regions=="Inner West")$wamnorth_sydney_wam <-filter(wam_df, geographic_regions=="North Sydney")$wam```A Welch two-sample one-sided $t$-test at the $\alpha = 0.05$ level was conducted to determine if the mean WAM of students in North Sydney is larger than those living in the Inner West. Initial EDA suggests this may be the case, with the mean WAM of students being `r round(mean(north_sydney_wam),1)` and `r round(mean(inner_west_wam),1)` respectively. We can also generate a QQ-plot of students' WAM, which shows the variable is normally distributed as it follows a linear regression.```{r}#| layout-nrow: 1#| label: fig-box-plot#| fig-cap: #| - "A: Box plot of WAMs of students from the Inner West and North Sydney"#| - "B: QQ-plot of WAMs of students from the Inner West and North Sydney"wam_df |>ggplot() +aes(y=wam, color=geographic_regions, x=geographic_regions, fill=geographic_regions) +geom_boxplot()+scale_color_manual(values =c('darkgreen','darkred')) +scale_fill_manual(values =c(rgb(217/255, 227/255, 215/255), rgb(230/255, 215/255, 214/255))) +theme(legend.position="none") +theme(plot.background =element_rect(fill ="#ffffff", #Changes the aesthetics of the chartlinewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +labs(y="WAM", #Changes the axis label and titlex="Region", title="A: Grouped Box Plot of WAM by Region")ggqqplot(wam_df, x ="wam", facet.by ="geographic_regions", color ="geographic_regions", palette=c('darkgreen','darkred'), legend='none', title="B: QQ-plot of WAM") +theme(plot.background =element_rect(fill ="#ffffff", #Changes the aesthetics of the chartlinewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5))test <-t.test(north_sydney_wam, inner_west_wam, alternative ='greater')shapiro1 <-shapiro.test((wam_df |>filter(geographic_regions =='Inner West'))$wam)shapiro2 <-shapiro.test((wam_df |>filter(geographic_regions =='North Sydney'))$wam)degrees_of_freedom <- test$parameter```<br>::: {.callout-note}## Welch two-sample one-sided $t$-test1. **Hypothesis** -- $H_0$: The mean WAM of student from North Sydney $\mu_{NS}$ equal the mean WAM of students from the Inner West $\mu_{IW}$. $H_1$: $\mu_{NS}$ is greater than $\mu_{NS}$.2. **Assumptions** -- The observations of both groups were independently and identically distributed to $\mathcal{N}(\mu_{i}, \sigma_{i}^2)$ for $i=NS, IW$, and that the observations of each group were independent. The observations are independent as this was a survey that could only be filled out once. The above QQ-plot (@fig-box-plot-2) shows that the WAM is normally distributed. Moreover using a Shapiro-Wilk test, both groups were consistent with a $X\sim\mathcal{N}(\mu_{i}, \sigma_{i}^2$, with p values of `r round(shapiro1$p)` for `Inner West` and `r round(shapiro2$p)` for `North Sydney`.3. **Test Statistic** -- $$T=\frac{\overline{NS}-\overline{IW}}{\sqrt{\frac{S_{ns}^2}{n_{ns}}+\frac{S_{iw}^2}{n_{iw}}}}$$Here, $S_{ns}^2$ and $S_{iw}^2$ are the sample variance of the $NS$ (North Sydney) and $IW$ (Inner West) samples.Under $H_0$, $T\sim t_{\nu}$, where $\nu=$ `r round(degrees_of_freedom,2)` as estimated from the data.4. **Observed Test Statistic** -- $t_0=$ `r round(as.numeric(str_replace_all(test$statistic, "[^(0-9)|.]", "")), 4)`5. **p-value** -- $p = P\left(t_\nu \geq t_0\right)=$ `r as.character(round(test$p.value, 3))`6. **Decision** -- As the $p$ -value was $<\alpha$, we can reject $H_0$. This implies that the mean WAM of students from North Sydney is significantly greater than those who live in the Inner West.:::## Does a student's Region have a significant influence on how many hours they work?Initial exploration of the data set suggested that there was a non-uniform distribution of working hours across different regions. The proportion fo students working between one and 10 hours per week was relatively similar, and the main differences were observed when comparing the proportion of students working no or more than 11 hours a week.<center>```{r}#| label: fig-hours_graph#| fig-cap: "Proportion bar chart of hours worked for different regions."employment_df <- df |>select(c(geographic_regions, employment_hrs_bin)) |>drop_na() |>mutate(`Employment Hours per Week`= employment_hrs_bin)employment_df |>ggplot() +aes(x=geographic_regions, fill=`Employment Hours per Week`) +geom_bar(colour ="black", #Creates a proportion bar chartlinewidth =0.5,position ="fill") +labs(y="Proportion of Hours Worked Category", #Changes the axis label and titlex="Region", title="Proportion of Students in Hours Worked by Region",legend="Travel Method") +theme(plot.background =element_rect(fill ="#ffffff", #Changes the aesthetics of the chartlinewidth =0),legend.background =element_rect(fill ="#ffffff", linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +scale_y_continuous(labels = scales::percent) +scale_x_discrete(labels=c("City and \n Eastern Suburbs", "Inner West", "North Sydney", "Outer South West, \n Greater Sydney and \n Regional NSW")) +scale_fill_brewer(palette ="Set2")```</center>A $\chi^2$-test for independence was performed at the $\alpha = 0.05$ level on the below contingency table. Yates's correction for continuity was used in the test.```{r}contingency_table <-table(employment_df$geographic_regions, employment_df$employment_hrs_bin) |>as.data.frame.matrix()contingency_table$`Region`=c('City and Eastern Suburbs', 'Inner West', 'North Sydney', 'Outer South West,\n Greater Sydney and \n Regional NSW')contingency_table |> gt::gt() |> gt::cols_move_to_start(columns=c(`Region`)) |> gt::tab_spanner(label ="Hours Worked", columns =1:3) |> gt::tab_header(title ="Count of Students by Hours Worked") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')test <-chisq.test(table(employment_df$geographic_regions, employment_df$employment_hrs_bin))degrees_of_freedom <- test$parameter```::: {.callout-note}## $\chi^2$-test for independence1. **Hypothesis** -- $H_0$: The amount of hours worked by a student is independent of their region. $H_1$: There is some interdependence between amount of hours worked and region.2. **Assumptions** -- The observations are independent, and the expected cell counts are greater than equal to 5. The observations are independent as this was a survey that could only be filled out once. There were `r numbers_to_words(sum(test$expected < 5))` expected cell counts less than 5, so these assumptions hold.3. **Test Statistic** -- $$T = \sum_{i=1}^3 \sum_{j=1}^4 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}$$Under $H_0$, $T\sim \chi^2_{`r degrees_of_freedom`}$.4. **Observed Test Statistic** -- $t_0=$ `r round(as.numeric(str_replace_all(test$statistic, "[^(0-9)|.]", "")), 4)`5. **p-value** -- $p=P(\chi^2_{`r degrees_of_freedom`} \geq t_0)<0.001$.6. **Decision** -- As the $p$ -value was $<\alpha$, we can reject $H_0$. This implies that is some interdependence between hours worked in a week and a student's region.:::# ConclusionThe geographic characteristics have been investigated during this report by grouping DATA2X02 students into regions and performing hypothesis tests on differing variables.Throughout the analysis, it was seen that geographic regions played a statistically significant role in the distribution of `Travel Method`, `WAM` and `Employment Hours per Week`. Specifically, it was found that the method of travel of students is dependent on whether they live in Sydney's City or Eastern Suburbs or not, the mean WAM of students from North Sydney is greater than those living in the Inner West, and employment hours per week was dependent on region.Future investigation into DATA2X02 cohorts may look to validate these results (to see if they are consistent to all DATA2X02 cohorts or just the 2023 cohort), as well as source more specific geographical information about students, rather than using their postcode.